ML Tutorials: Overfitting vs. Underfitting

When applying training data to supervised algorithms, there tends to be bias when splitting the data into training and evaluation datasets. On one hand, you want your model to have the most information as possible, while on the other hand you want to have a good test sample to evaluate your model's performance effectively. This tutorial will cover the following learning objectives:

Overfitting
Underfitting

Overfitting

Summary

Overfitting occurs when the ratio between the training data fitted to the model and the evaluation data is too much, resulting in a model that can't accept new data points without breaking the efficacy of the algorithm.
In the context of Machine Learning, Noise occurs when data points in a data set are erroneous, thus causing the output of the ML algorithm to be altered negatively. A common example of noise are outliers or data points that were not cleaned proeprly (e.g., the weight value is in the age value, resulting in bias).
Overfitting can occur when a particular feature has high variance. This can cause the model to fit the regression line close to some data points, but leave out the rest. High variance typically occurs when other variables that are not considered in the dataset impact other features (such as the wind example in the video).
One of the most common cause of overfitting is lack of training data. If you only have 100 data points, it may not be near enough for the model to infer new values based on the given inputs.

Underfitting

Summary

Underfitting occurs when the ratio between the training data fitted to the model and the evaluation data is too little, resulting in a model that is too generalized to make an effective prediction.
Just like with overfitting, underfitting can be caused by Noise present in the data. If the data is not cleaned properly, the model will not be able to find an existing trend between features.
Bias is a leading cause in underfitting. If you are predicting house prices and 90% of your training data contains homes that are 5000 sq. ft or larger, then when smaller houses are trying to be predicted, the model will output a garbage value because it's biased towards larger homes.
Just like with overfitting, underfitting can be caused by a lack of training data. If you don't have enough data present, your model will be too generalized to make a valid prediction.

Previous Topic

Next Topic